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Abstract 

Variational inference is a scalable technique for approximate Bayesian inference. 
Deriving variational inference algorithms requires tedious model-specific calculations; 
this makes it difficult to automate. We propose an automatic variational inference al¬ 
gorithm, automatic differentiation variational inference (advi). The user only provides 
a Bayesian model and a dataset; nothing else. We make no conjugacy assumptions and 
support a broad class of models. The algorithm automatically determines an appropri¬ 
ate variational family and optimizes the variational objective. We implement advi in 
Stan (code available now), a probabilistic programming framework. We compare advi 
to mcmc sampling across hierarchical generalized linear models, nonconjugate matrix 
factorization, and a mixture model. We train the mixture model on a quarter million 
images. With advi we can use variational inference on any model we write in Stan. 


1 Introduction 

Bayesian inference is a powerful framework for analyzing data. We design a model for data 
using latent variables; we then analyze data by calculating the posterior density of the latent 
variables. For machine learning models, calculating the posterior is often difficult; we resort 
to approximation. 
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Seconds 



Seconds 


(a) Subset of 1000 images 


(b) Full dataset of 250 000 images 


Figure 1: Held-out predictive accuracy results | GMM of the imageCLEF image histogram 
dataset, (a) ADVI outperforms the NUTS, the default sampling method in Stan 0- (b) 
ADVI scales to large datasets by subsampling minibatches of size B from the dataset at each 

and Appendix [jj 


iteration |3j. We present more details in Section 


3.3 


Variational inference (vi) approximates the posterior with a simpler density [I] El- We 
search over a family of simple densities and find the member closest to the posterior. This 
turns approximate inference into optimization. VI has had a tremendous impact on machine 
learning; it is typically faster than Markov chain Monte Carlo (mcmc) sampling (as we show 
here too) and has recently scaled up to massive data [3]. 

Unfortunately, VI algorithms are difficult to derive. We must first define the family of 
approximating densities, and then calculate model-specific quantities relative to that family 
to solve the variational optimization problem. Both steps require expert knowledge. The 
resulting algorithm is tied to both the model and the chosen approximation. 

In this paper we develop a method for automating variational inference, automatic dif¬ 
ferentiation variational inference (advi). Given any model from a wide class (specifically, 
differentiable probability models), ADVI determines an appropriate variational family and 
an algorithm for optimizing the corresponding variational objective. We implement ADVI in 
Stan [4], a flexible probabilistic programming framework originally designed for sampling- 
based inference. Stan describes a high-level language to define probabilistic models (e.g., 
Figure [2| as well as a model compiler, a library of transformations, and an efficient auto¬ 
matic differentiation toolbox. With ADVI we can now use variational inference on any model 
we can express in StanQ (See Appendices [f] to [j]) 

Figure [l] illustrates the advantages of our method. We present a nonconjugate Gaussian 
mixture model for analyzing natural images; this is 40 lines in Stan (Figure [l0|. Section la 
illustrates Bayesian inference on 1000 images. The y-axis is held-out likelihood, a measure 
of model fitness; the x-axis is time (on a log scale). ADVI is orders of magnitude faster than 
NUTS, a state-of-the-art MCMC algorithm (and Stan’s default inference technique) [5]. We 
also study nonconjugate factorization models and hierarchical generalized linear models; we 
consistently observe speed-up against NUTS. 

Section [lbjillustrates Bayesian inference on 250 000 images, the size of data we more com¬ 
monly find in machine learning. Here we use ADVI with stochastic variational inference [3] , 
giving an approximate posterior in under two hours. For data like these, MCMC techniques 
cannot even practically begin analysis, a motivating case for approximate inference. 

Related Work. ADVI automates variational inference within the Stan probabilistic 
programming framework [J. This draws on two major themes. 

The first is a body of work that aims to generalize VI. Kingma and Welling |B and 
Rezende et al. (7 describe a reparameterization of the variational problem that simplifies 
optimization. Ranganath et al. |5j and Salimans and Knowles [9. propose a black-box 


^advi is available in Stan 2.7 (development branch). It will appear in Stan 2.8. See Appendix [c| 
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technique that only uses the gradient of the approximating family for optimization. Titsias 
and Lazaro-Gredilla m leverage the gradient of the model for a small class of models. We 
build on and extend these ideas to automate variational inference; we highlight technical 
connections as we develop our method. 

The second theme is probabilistic programming. Wingate and Weber H2 study VI in 
general probabilistic programs, as supported by languages like Church m, Venture m , 
and Anglican m- Another probabilistic programming framework is infer.NET, which im¬ 
plements variational message passing EE an efficient algorithm for conditionally conjugate 
graphical models. Stan supports a more comprehensive class of models that we describe in 
Section 12.11 


2 Automatic Differentiation Variational Inference 

Automatic differentiation variational inference (advi) follows a straightforward recipe. First, 
we transform the space of the latent variables in our model to the real coordinate space. 
For example, the logarithm transforms a positively constrained variable, such as a standard 
deviation, to the real line. Then, we posit a Gaussian variational distribution. This induces 
a non-Gaussian approximation in the original variable space. Last, we combine automatic 
differentiation with stochastic optimization to maximize the variational objective. We begin 
by defining the class of models we support. 


2.1 Differentiable Probability Models 


Consider a dataset X = xi ; jv with N observations. Each x n is a discrete or continuous 
random vector. The likelihood p(X | 9) relates the observations to a set of latent random 
variables 9. Bayesian analysis posits a prior density p{9) on the latent variables. Combining 
the likelihood with the prior gives the joint density p(X, 9) = p(X | 9) p(9). 

We focus on approximate inference for differentiable probability models. These models 
have continuous latent variables 9. They also have a gradient of the log-joint with respect 
to the latent variables Vg logp(X, 9). The gradient is valid within the support of the prior 
supp(p(0)) = { 9 | 9 G iV and p(9) > 0 } C IV, where K is the dimension of the latent 
variable space. This support set is important: it determines the support of the posterior 
density and will play an important role later in the paper. Note that we make no assumptions 
about conjugacy, either fulQor conditional]^] 

Consider a model that contains a Poisson likelihood with unknown rate, p[x | A). The 
observed variable x is discrete; the latent rate A is continuous and positive. Place an 
exponential prior for A, defined over the positive real numbers. The resulting joint den¬ 
sity describes a nonconjugate differentiable probability model. (See Figure !) Its par¬ 
tial derivative d/d\p{x, A) is valid within the support of the exponential distribution, 
supp(p(A)) = R + C R. Since this model is nonconjugate, the posterior is not an expo¬ 
nential distribution. This presents a challenge for classical variational inference. We will see 
how ADVI handles this model later in the paper. 

Many machine learning models are differentiable probability models. Linear and logistic 
regression, matrix factorization with continuous or discrete measurements, linear dynamical 
systems, and Gaussian processes are prime examples. In machine learning, we usually 
describe mixture models, hidden Markov models, and topic models with discrete random 
variables. Marginalizing out the discrete variables reveals that these are also differentiable 
probability models. (We show an example in Section 3.3 ) Only fully discrete models, such 
as the Ising model, fall outside of this category. 


2 The posterior of a fully conjugate model is in the same family as the prior. 

3 A conditionally conjugate model has this property within the complete conditionals of the model [5]. 
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Aq — 1 


data { 
int N; 

int x[N]; // discrete - valued observations 

> 

parameters { 

// latent variable , must be positive 
real <lower=0> lambda; 

> 

model { 

// non-conjugate prior for latent variable 
lambda ~ exponential ( 1.0 ) ; 

// likelihood 
for (n in 1:N) 

increment_log_prob ( poisson_log (x [n] , lambda)) ; 

> 

Figure 2: Specifying a simple nonconjugate probability model in Stan. 



2.2 Variational Inference 

In Bayesian inference, we seek the posterior density p(6 | X), which describes how the 
latent variables vary, conditioned on a set of observations X. Many posterior densities are 
intractable because they lack analytic (closed-form) solutions. Thus, we seek to approximate 
the posterior. 

Consider an approximating density q(6 ; </>) parameterized by </>. We make no assump¬ 
tions about its shape or support. We want to find the parameters of q{0 ; </>) to best match 
the posterior according to some loss function. Variational inference (vi) minimizes the 
Kullback-Leibler (kl) divergence, 

minKL(g(0;0)||j>(0|X)), (1) 

0 

from the approximation to the posterior f2]. Typically the KL divergence also lacks an 
analytic form. Instead we maximize a proxy to the KL divergence, the evidence lower bound 
(elbo) 


C{ct>) = E, (9) [ logp(X, 0)] - E, (9) [ log q(0 ; </>)]. 

The first term is an expectation of the joint density under the approximation, and the second 
is the entropy of the variational density. Maximizing the ELBO minimizes the KL divergence 

mm- 

The minimization problem from Equation [T] becomes 

cj)* = argmax£((/>) such that supp(g(0 ;</>)) C supp(p(0 | X)), (2) 

0 

where we explicitly specify the support matching constraint implied in the KL divergence]^] 
We highlight this constraint, as we do not specify the form of the variational approximation; 
thus we must ensure that q{0 ; <j>) stays within the support of the posterior, which is equal 
to the support of the prior. 

Why is VI difficult to automate? In classical variational inference, we typically design 
a conditionally conjugate model; the optimal approximating family matches the prior, which 
satisfies the support constraint by definition m- In other models, we carefully study the 
model and design custom approximations. These depend on the model and on the choice of 
the approximating density. 

One way to automate VI is to use black-box variational inference 0E- If we select a 
density whose support matches the posterior, then we can directly maximize the ELBO using 
Monte Carlo (mc) integration and stochastic optimization. Another strategy is to restrict 
the class of models and use a fixed variational approximation m- For instance, we may use 

4 If supp(g) 2 supp(p) then outside the support of p we have KL (q || p) = fy[log q] — E q [logp] = — oc. 


4 




a Gaussian density for inference in unrestrained differentiable probability models, i.e. where 
supp (p(0)) = RA 

We adopt a transformation-based approach. First, we automatically transform the sup¬ 
port of the latent variables in our model to the real coordinate space. Then, we posit a 
Gaussian variational density. The inverse of our transform induces a non-Gaussian vari¬ 
ational approximation in the original variable space. The transformation guarantees that 
the non-Gaussian approximation stays within the support of the posterior. Here is how it 
works. 


2.3 Automatic Transformation of Constrained Variables 

Begin by transforming the support of the latent variables 6 such that they live in the real 
coordinate space M. K . Define a one-to-one differentiable function 

T : supp(p(0)) -A R*, (3) 


and identify the transformed variables as £ = T(0). The transformed joint density g(X, () 
is a function of C; it has the representation 


5 (X,C)=p(X,T- 1 (C))|detJ T - 1 (C) 


where p is the joint density in the original latent variable space, and J T -1 (£) is the Jacobian 
of the inverse of T. Transformations of continuous probability densities require a Jacobian; 
it accounts for how the transformation warps unit volumes |17j . (See Appendix |D|) 

Consider again our running example. The rate A lives in R + . The logarithm £ = 
T(A) = log(A) transforms R + to the real line R. Its Jacobian adjustment is the deriva¬ 
tive of the inverse of the logarithm, | det I = ex P(0- The transformed density is 

g(x,() = Poisson (a? | exp(£)) Exponential(exp(£)) exp(£). Figures 3a and 3b depict this 
transformation. 

As we describe in the introduction, we implement our algorithm in Stan to enable generic 
inference. Stan implements a model compiler that automatically handles transformations. It 
works by applying a library of transformations and their corresponding Jacobians to the joint 
model density^] This transforms the joint density of any differentiable probability model to 
the real coordinate space. Now, we can choose a variational distribution independent from 
the model. 



(b) Real coordinate space (c) Standardized space 

(a) Latent variable space 


Figure 3: Transformations for ADVI. The purple line is the posterior. The green line is 
the approximation, (a) The latent variable space is R + . (a—>-b) T transforms the latent 
variable space to R. (b) The variational approximation is a Gaussian, (b—>c) S'^ iU! absorbs 
the parameters of the Gaussian, (c) We maximize the elbo in the standardized space, with 
a fixed standard Gaussian approximation. 


5 Stan provides transformations for upper and lower bounds, simplex and ordered vectors, and structured 
matrices such as covariance matrices and Cholesky factors |4j. 
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2.4 Implicit Non-Gaussian Variational Approximation 

After the transformation, the latent variables ( have support on . We posit a diagonal 
(mean-field) Gaussian variational approximation 


K 


q(C ; 4>) = V(C; n, <j 2 ) = A f{C , k ; Hk, 


fc=i 


where the vector <f> = (/zi, • • • , /z^, cr 2 , • • • , a concatenates the mean and variance of each 
Gaussian factor. This defines our variational approximation in the real coordinate space. 


(Figure 3b ) 


The transformation T from Equation [3] maps the support of the latent variables to 
the real coordinate space. Thus, its inverse T -1 maps back to the support of the latent 
variables. This implicitly defines the variational approximation in the original latent variable 
space as J\f(T~ 1 {C t ); /z, <r 2 )| det J T -i (£) | ■ The transformation ensures that the support of this 
approximation is always bounded by that of the true posterior in the original latent variable 
space (Figure [3a|. Thus we can freely optimize the ELBO in the real coordinate space (Figure 
3b I without worrying about the support matching constraint. 

The ELBO in the real coordinate space is 


£(/z,fj 2 ) = E, (c) 


logp(X,T (C)) + log detJ T -i«) 


K 


K 


+ y (1 + log(27r)) logcr fc , 


k =1 


where we plug in the analytic form for the Gaussian entropy. (Derivation in Appendix |A| ) 
We choose a diagonal Gaussian for its efficiency and analytic entropy. This choice may 
call to mind the Laplace approximation technique, where a second-order Taylor expansion 
around the maximum-a-posteriori estimate gives a Gaussian approximation to the poste¬ 
rior. However, using a Gaussian variational approximation is not equivalent to the Laplace 
approximation CHI. Our approach is distinct in another way: the posterior approximation 
in the original latent variable space (Figure 3a I is non-Gaussian, because of the inverse 
transformation T~ l and its Jacobian. 


2.5 Automatic Differentiation for Stochastic Optimization 

We now seek to maximize the ELBO in real coordinate space, 

/.<*, a 2 * = argmax£(/z, cr 2 ) such that a 2 >- 0. (4) 


We can use gradient ascent to reach a local maximum of the ELBO. Unfortunately, we cannot 
apply automatic differentiation to the ELBO in this form. This is because the expectation 
defines an intractable integral that depends on /z and cr 2 ; we cannot directly represent it 
as a computer program. Moreover, the variance vector cr 2 must remain positive. Thus, we 
employ one final transformation: elliptical standardization |19), shown in Figures 3b and 


First, re-parameterize the Gaussian distribution with the log of the standard deviation, 
uj = log(cr), applied element-wise. The support of u is now the real coordinate space and a is 
always positive. Then, define the standardization 77 = (0 = diag(exp(w^ 1 ))(^'— n). The 

standardization encapsulates the variational parameters; in return it gives a fixed variational 
density 


K 

q(v ! 0,1) = Af{r }; 0,1) = N(j) k ; 0,1). 

k =1 

6 Also known as a “co-ordinate transformation” [7], an “invertible transformation” m, and the “re¬ 
parameterization trick” [6]. 
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Algorithm 1: Automatic Differentiation Variational Inference 

Input: Dataset X = xi : jv, model p(X.,d). 

Set iteration counter * = 0 and choose a stepsize sequence p M. 

Initialize p^ = 0 and = 0. 

while change in ELBO is above some threshold do 

Draw M samples 7 ? m ~ A/"(0,1) from the standard multivariate Gaussian. 

Invert the standardization £ m = diag(exp(u/*)))? 7 m + p^\ 

Approximate V M £ and using MC integration (Equations [s] and |gJ) . 
Update /x (i+1) 4— and uA i+1 ) <— w® + p (i) V w £. 

Increment iteration counter. 

end 

Return p* <— p ^ and u* <— w (i) . 


The standardization transforms the variational problem from Equation [4] into 
p*,co* = argmax£(/r, w) 

K 

+ 

k =l 

where we drop independent term from the calculation. The expectation is now in terms of 
the standard Gaussian, and both parameters p and to are unconstrained. (Figure |3cj) We 
push the gradient inside the expectations and apply the chain rule to get 

V M £ = Ejv (? 7 ) [V 9 logp(X ) fl)V c T- 1 (C)+V c log|det J r -i(C)|] , (5) 

V bJk C=E M{rik) [(V efc logp(X, 6 »)V a T- 1 (C) + V a log|det J T -i(C)|)? 7 fcexp(w fc )] +1. ( 6 ) 
(Derivations in Appendix |Bj) 

We can now compute the gradients inside the expectation with automatic differentiation. 
This leaves only the expectation. MC integration provides a simple approximation: draw 
M samples from the standard Gaussian and evaluate the empirical mean of the gradients 
within the expectation [20] ■ This gives unbiased noisy estimates of gradients of the ELBO. 

2.6 Scalable Automatic Variational Inference 

Equipped with unbiased noisy gradients of the ELBO, ADVI implements stochastic gradient 
ascent. (Algorithm [l] ) We ensure convergence by choosing a decreasing step-size schedule. 
In practice, we use an adaptive schedule m with finite memory. (See Appendix [E] for 
details.) 

ADVI has complexity 0(2NMK) per iteration, where M is the number of MC samples 
(typically between 1 and 10). Coordinate ascent VI has complexity 0(2NK) per pass over 
the dataset. We scale ADVI to large datasets using stochastic optimization HUS]. The 
adjustment to Algorithm [T] is simple: sample a minibatch of size B <C from the dataset 
and scale the likelihood of the model by N/B |3]. The stochastic extension of ADVI has a 
per-iteration complexity 0(2BMK). 


arg max E^,,. o,i) 

fl,0J 


logp(X ,T 1 {S li Ui 7 ))) + log I det J T -1 (S' * fa)) 


3 Empirical Study 

We now study ADVI across a variety of models. We compare its speed and accuracy to two 
Markov chain Monte Carlo (mcmc) sampling algorithms: Hamiltonian Monte Carlo (hmc) 
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Seconds Seconds 

(a) Linear Regression with ard (b) Hierarchical Logistic Regression 

Figure 4: Hierarchical Generalized Linear Models. 

[21] and the no-U-turn sampler (nuts)[^] fS]. We assess ADVI convergence by tracking the 
ELBO; assessing convergence with MCMC techniques is less straightforward. To place ADVI 
and MCMC on a common scale, we report predictive accuracy on held-out data as a function 
of time. We approximate the Bayesian posterior predictive using MC integration. For the 
MCMC techniques, we plug in posterior samples into the likelihood. For ADVI, we do the 
same by drawing a sample from the posterior approximation at fixed intervals during the 
optimization. We initialize ADVI with a draw from a standard Gaussian. 

We explore two hierarchical regression models, two matrix factorization models, and a 
mixture model. All of these models have nonconjugate prior structures. We conclude by 
analyzing a dataset of 250 000 images, where we report results across a range of minibatch 
sizes B. 

3.1 A Comparison to Sampling: Hierarchical Regression Models 

We begin with two nonconjugate regression models: linear regression with automatic rele¬ 
vance determination (ard) pD5| and hierarchical logistic regression [22, ■ 

Linear Regression with ard. This is a sparse linear regression model with a hierar¬ 
chical prior structure. (Details in Appendix |Fj) We simulate a dataset with 250 regressors 
such that half of the regressors have no predictive power. We use 10 000 training samples 
and hold out 1000 for testing. 

Logistic Regression with Spatial Hierarchical Prior. This is a hierarchical logistic 
regression model from political science. The prior captures dependencies, such as states 
and regions, in a polling dataset from the United States 1988 presidential election [22. 
The model is nonconjugate and would require some form of approximation to derive a VI 
algorithm. (Details in Appendix [G] ) 

We train using 10 000 data point and withhold 1536 for evaluation. The regressors 
contain age, education, and state and region indicators. The dimension of the regression 
problem is 145. 

Results. Figure [4] plots average log predictive accuracy as a function of time. For these 
simple models, all methods reach the same predictive accuracy. We study ADVI with two 
settings of M, the number of MC samples used to estimate gradients. A single sample per 
iteration is sufficient; it also is the fastest. (We set M = 1 from here on.) 

3.2 Exploring nonconjugate Models: Non-negative Matrix Factor¬ 
ization 

We continue by exploring two nonconjugate non-negative matrix factorization models: a 
constrained Gamma Poisson model [2 H and a Dirichlet Exponential model. Here, we show 

7 nuts is an adaptive extension of hmc. It is the default sampler in Stan. 
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(a) Gamma Poisson Predictive Likelihood (b) Dirichlet Exponential Predictive Likeli¬ 
hood 



(c) Gamma Poisson Factors (d) Dirichlet Exponential Factors 


Figure 5: Non-negative matrix factorization of the Frey Faces dataset. 


how easy it is to explore new models using ADVI. In both models, we use the Frey Face 
dataset, which contains 1956 frames (28 x 20 pixels) of facial expressions extracted from a 
video sequence. 

Constrained Gamma Poisson. This is a Gamma Poisson factorization model with 
an ordering constraint: each row of the Gamma matrix goes from small to large values. 
(Details in Appendix |H[) 

Dirichlet Exponential. This is a nonconjugate Dirichlet Exponential factorization 
model with a Poisson likelihood. (Details in Appendix [I]) 

Results. Figure [5] shows average log predictive accuracy as well as ten factors recovered 
from both models. ADVI provides an order of magnitude speed improvement over NUTS 
(Figure [5a|. NUTS struggles with the Dirichlet Exponential model (Figure 5b I. In both 
cases, HMC does not produce any useful samples within a budget of one hour; we omit HMC 
from the plots. 

The Gamma Poisson model (Figure |5c| appears to pick significant frames out of the 
dataset. The Dirichlet Exponential factors (Figure [5d| are sparse and indicate components 
of the face that move, such as eyebrows, cheeks, and the mouth. 


3.3 Scaling to Large Datasets: Gaussian Mixture Model 

We conclude with the Gaussian mixture model (gmm) example we highlighted earlier. This 
is a nonconjugate GMM applied to color image histograms. We place a Dirichlet prior on 
the mixture proportions, a Gaussian prior on the component means, and a lognormal prior 
on the standard deviations. (Details in Appendix [j]) We explore the imageCLEF dataset, 
which has 250 000 images [25] ■ We withhold 10 000 images for evaluation. 

In Figure [la] we randomly select 1000 images and train a model with 10 mixture compo¬ 
nents. NUTS struggles to find an adequate solution and HMC fails altogether. This is likely 
due to label switching, which can affect HMC-based techniques in mixture models |3] . 

Figure [Tb] shows ADVI results on the full dataset. Here we use ADVI with stochastic 
subsampling of minibatches from the dataset [3]. We increase the number of mixture com¬ 
ponents to 30. With a minibatch size of 500 or larger, ADVI reaches high predictive accuracy. 
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Smaller minibatch sizes lead to suboptimal solutions, an effect also observed in (3]. ADVI 
converges in about two hours. 


4 Conclusion 

We develop automatic differentiation variational inference (advi) in Stan. ADVI leverages 
automatic transformations, an implicit non-Gaussian variational approximation, and auto¬ 
matic differentiation. This is a valuable tool. We can explore many models, and analyze 
large datasets with ease. We emphasize that ADVI is currently available as part of Stan; it 
is ready for anyone to use. 
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A Transformation of the Evidence Lower Bound 


Recall that £ = T(6) and that the variational approximation in the real coordinate space is 
<?(C; MW 2 )- 

We begin with the evidence lower bound (elbo) in the original latent variable space. 
We then transform the latent variable space of to the real coordinate space. 


r = /,( S ;«.o g W x -"> 

= Jq( C; mw 2 )log 


,g(0; </>)J 

p(x,r- 1 (0)|detJ T - 1 (C)l 


dc 


9(C; mw 2 ) 

= J q{ C; mw 2 ) log [p(x,t _ 1 (0)| det J r -i(C)|] dC- J q( C; mw 2 ) log [<?(C; mw 2 )] dc 

= E ?(C) [logpCX,! 7-1 ^)) + log | det J T -i(C)|] E g(C) [l°S9(C ! MW 2 )] 


The variational approximation in the real coordinate space is a Gaussian. Plugging in its 
entropy gives the ELBO in the real coordinate space 


£ = E g(C) [logp(X,T X (C)) + log | det Jt-i(C)|] + ^(1+ log(27r)) + ^logcr*. 

k =1 


B Gradients of the Evidence Lower Bound 

First, consider the gradient with respect to the p parameter of the standardization. We 
exchange the order of the gradient and the integration through the dominated convergence 
theorem [26]. The rest is the chain rule for differentiation. 

= V^E^.qj) [logp(X,T- 1 (5“i ; (??))) +log|det J t - 1 (S , “J j (jj))|] 

K K 

+ y(l + log(27r)) log cr fc | 

k =1 

= Ev ( „ ; o.i) [V„ {logp(X, T-^S- 1 (77)) + log | det J T -1 (S" 1 ^)) |}] 

= Ev ( ^ ; o,i) [V fl logp(X, 0)V^T _1 (C)V (rj) + V c log | det J r -x (C) | V^ifa)] 

= ®V (??; 0 . 1 ) [Ve logp(X, 0)V C T- 1 (C) + V c log | det J T -i (C) |] 

Similarly, consider the gradient with respect to the ui parameter of the standardization. 
The gradient with respect to a single component, u>k, has a clean form. We abuse the V 
notation to maintain consistency with the rest of the text (instead of switching to d). 

V Wfc £ = V^jE^. 0 , 1 ) [logp(X,+ log | det J T ~i ( 7 ?))|] 

K K 

+ y (! + log(27r)) + ^ log(exp(w fe ))| 

k =1 

= Ejv ( ? 7 fc) [V„Jlogp(X,r- 1 (S-i(f,)))+log|detJ r - 1 (5-i,(» ? ))|}] +1 
= E^ ( , fc) [(V flfc logp(X,0)V a T- 1 (C) + V Cfc log | det J T -i(C)|) V^S;»)] + 1. 

= E At(^) [(v flfc logp(X, 0)V a T _1 (C) + v a log | det J T -1 (01) Vk exp(w fc )] + 1. 
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C Running ADVI in Stan 

Use git to checkout the feature/bbvb branch from https://github.com/stan-dev/stan. 
Follow instructions to build Stan. Then download cmdStan from https://github.com/stan- 
dev/cmdstan. Follow instructions to build cmdStan and compile your model. You are then 
ready to run ADVI. 

The syntax is 

./myModel experimental variational 

grad_samples=M ( M = 1 default ) 

data flle=myData.data.R 

output file=output_advi.csv 

diagnostic_file=elb o_advi. cs v 

where myData.data.R is the dataset in the R language dump format, output_advi.csv 

contains samples from the posterior and elbo_advi.csv reports the ELBO. 


D Transformations of Continuous Probability Densities 


We present a brief summary of transformations, largely based on nz]. 

Consider a univariate (scalar) random variable X with probability density function 
fx(x). Let X = supp (fx{x)) be the support of X. Now consider another random vari¬ 
able Y defined as Y = T{X). Let y = supp(/y (;(/)) be the support of Y. 

If T is a one-to-one and differentiable function from X to y, then Y has probability 
density function 


fv(y) = fx(T- 1 (y)) 


d T-\y) 
d y 


Let us sketch a proof. Consider the cumulative density function Y. If the transformation 
T is increasing, we directly apply its inverse to the cdf of Y. If the transformation T is 
decreasing, we apply its inverse to one minus the cdf of Y. The probability density function 
is the derivative of the cumulative density function. These things combined give the absolute 
value of the derivative above. 

The extension to multivariate variables X and Y requires a multivariate version of the 
absolute value of the derivative of the inverse transformation. This is the the absolute 
determinant of the Jacobian, | det J T -i (Y)| where the Jacobian is 


J T - i(Y) = 


dy i 

dyK 

dT- 1 

dT- 1 

dy i 

dyK 


Intuitively, the Jacobian describes how a transformation warps unit volumes across 
spaces. This matters for transformations of random variables, since probability density 
functions must always integrate to one. If the transformation is linear, then we can drop 
the Jacobian adjustment; it evaluates to one. Similarly, affine transformations, like elliptical 
standardizations, do not require Jacobian adjustments; they preserve unit volumes. 


E Setting a Stepsize Sequence for ADVI 

We use adaGrad m to adaptively set the stepsize sequence in ADVI. While adaGrad offers 
attractive convergence properties, in practice it can be slow because it has infinite memory. 
(It tracks the norm of the gradient starting from the beginning of the optimization.) In 
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ADVI we randomly initialize the variational approximation, which can be far from the true 
posterior. This makes adaGrad take very small steps for the rest of the optimization, thus 
slowing convergence. Limiting adaGrad’s memory speeds up convergence in practice, an 
effect also observed in training neural networks |5J]. (See |28| for an analysis of these 
trade-offs and a method that combines benefits from both.) 

Consider the stepsize pb) and a gradient vector gb) at iteration i. The fcth element of 
pb) is 


T + 



where, in adaGrad, s is the gradient vector squared, summed over all times steps since the 
start of the optimization. Instead, we limit this to the past ten iterations and compute s as 


D b) 


= 9k 


2 (i—10) 


2b—9) 
' 9 k 


2 b) 
' 9k 


(In practice, we implement this recursively to save memory.) We set r/ = 0.1 and r = 1 as 
the default values we use in Stan. 


F Linear Regression with Automatic Relevance Deter¬ 
mination 


Linear regression with automatic relevance determination (ard) is a high-dimensional sparse 
regression model [16] [29]. We describe the model below. Stan code is in Figure [6] 

The inputs are X = x 1: jv where each x„ is D-dimensional. The outputs are y = yi-,N 
where each y n is 1-dimensional. The weights vector w is D-dimensional. The likelihood 

N 

Pi y I x > w, r) = IJ N (y n | w T x„ , t -1 ) 

n=l 

describes measurements corrupted by iid Gaussian noise with unknown variance r -1 . 

The ARD prior and hyper-prior structure is as follows 


p(w, r, a) = p( w, r | a)p(a) 


D 

= JV (w I 0 , (rdiagfa]) -1 ) Gam(r | a 0l b 0 ) Gam(aj | c 0 , d 0 ) 

i= 1 


where a is a D-dimensional hyper-prior on the weights, where each component gets its own 
independent Gamma prior. 

We simulate data such that only half the regressions have predictive power. The results 
in Figure |4a| use ao = = cq = do = 1 as hyper-parameters for the Gamma priors. 


G Hierarchical Logistic Regression 

Hierarchical logistic regression models dependencies in an intuitive and powerful way. We 
study a model of voting preferences from the 1988 United States presidential election. Chap¬ 
ter 14.1 of [23] motivates the model and explains the dataset. We also describe the model 
below. Stan code is in Figure [7J based on [4]. 

Pr(y n = 1) = a (V + ^ female • female„ + /3 bIack • black„ + /3 female - black • female.black„ 

+ a k[n] + a l[n] + a k[n],l[n\ + a j[n] J 
^state \r /region , nv.prev , T \ 

a j ~ -N + P ■ v.prevj , a state j 
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where cr(-) is the sigmoid function (also know as the logistic function). 
The hierarchical variables are 


rv age - 
a k 

~A7(0, 

2 ' 
^age. 

) for k = 1,. 

..,K 


,, edu 
<*l r 

-A7(0, 

2 

°edu 

i—l 

II 

Sh 

..,L 


age.edu 

a k,l 

0, 

2 

°age. 

edu) for k = 

1,. 

= 1,. 

..,L 

region r 

~A7(0, 

^region) for m = 

1,- 

■ ■ ,M. 



The variance terms all have uniform hyper-priors, constrained between 0 and 100. 


H Non-negative Matrix Factorization: Constrained Gamma 
Poisson Model 

The Gamma Poisson factorization model is a powerful way to analyze discrete data matrices 

mm- 

Consider a U x I matrix of observations. We find it helpful to think of u = {1, • • • , 17} as 
users and i = {1, • • • ,/} as items, as in a recommendation system setting. The generative 
process for a Gamma Poisson model with K factors is 

1. For each user u in {1, • • ■ ,17}: 

• For each component k, draw 9 u k ~ Gam(ao, b 0 ). 

2. For each item i in {1, • • • ,/}: 

• For each component k, draw d,:* ~ Gam(co, do). 

3. For each user and item: 

• Draw the observation y U i ~ Poisson 

A potential downfall of this model is that it is not uniquely identifiable: scaling 9 U by 
a and /% by a -1 gives the same likelihood. One way to contend with this is to constrain 
either vector to be a positive, ordered vector during inference. We constrain each 9 U vector 
in our model in this fashion. Stan code is in Figure [8] We set I\ = 10 and all the Gamma 
hyper-parameters to 1 in our experiments. 


I Non-negative Matrix Factorization: Dirichlet Expo¬ 
nential Model 

Another model for discrete data is a Dirichlet Exponential model. The Dirichlet enforces 
uniqueness while the exponential promotes sparsity. This is a non-conjugate model that 
does not appear to have been studied in the literature. 

The generative process for a Dirichlet Exponential model with K factors is 

1. For each user u in {1, • ■ ■ ,17}: 

• Draw the A'-vector 9 U ~ Dir(ao)- 

2. For each item i in {1, • • • , /}: 

• For each component k, draw ~ Exponential(Ao). 

3. For each user and item: 

• Draw the observation y U i ~ Poisson(d l }/3i). 
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Stan code is in Figure [9j We set K = 10, cto = 1000 for each component, and Aq = 
0.1. With this configuration of hyper-parameters, the factors /3» are sparse and appear 
interpretable. 

J Gaussian Mixture Model 

The Gaussian mixture model (gmm) is a powerful probability model. We use it to group a 
dataset of natural images based on their color histograms. We build a high-dimensional GMM 
with a Gaussian prior for the mixture means, a lognormal prior for the mixture standard 
deviations, and a Dirichlet prior for the mixture components. 

The images are in X = Xi,jv where each x„ is D-dimensional and there are N observa¬ 
tions. The likelihood for the images is 


N K D 

p(X | 8,/j,, a) = nn @k A/* {x n d | Hkdi &kd) 

n—1k— 1 d—1 

with a Dirichlet prior for the mixture proportions 

p(0) = Dir(0; a 0 ), 

a Gaussian prior for the mixture means 


D D 

p(v)= n 

k =1 d= 1 

and a lognormal prior for the mixture standard deviations 

D D 

p(a) = nn logNormal(cr fcd ; 0, a a ) 

k—1 d—1 


The dimension of the color histograms in the imageCLEF dataset is D = 576. These a 
concatenation of three 192-length histograms, one for each color channel (red, green, blue) 
of the images. 

We scale the image histograms to have zero mean and unit variance and set ao = 10 000, 
= 0.1 and ay,. ADVI code is in Figure 10 The stochastic data subsampling version of the 
code is in Figure ED 
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data {. 

int < lower =0> N 
int < lower =0> D 
matrix [N,D] x 
vector [N] y 


// number of data items 
// dimension of input features 
// input matrix 
// output vector 


// hyperparameters for Gamma priors 

real <lower=0> aO ; 

real <lower=0> bO ; 

real <lower=0> cO ; 

real <lower=0> dO ; 


parameters { 

vector [D] w; // 

real <lower=0> sigma2 ; // 

vector <lower=0>[D] alpha; // 

> 


weights (coefficients) vector 
variance 

hyper - parameters on weights 


transformed parameters { 

real sigma; // standard deviation 

vector [D] one_over_sqrt_alpha ; // numerical stability 

sigma <- sqrt ( sigma2 ) ; 
for (i in 1 :D) { 

one_over_sqrt_alpha [ i ] <- 1 / sqrt ( alpha [ i ]) ; 

> 


model -( 

// alpha: hyper-prior on weights 
alpha ~ gamma( cO , dO) ; 

// sigma2 : prior on variance 
sigma2 ~ inv_gamma ( aO , bO ) ; 

// w: prior on weights 

w ~ normal (0, sigma * one_over_sqrt_alpha); 

// y: likelihood 
y ~ normal(x * w, sigma); 


Figure 6: Stan code for Linear Regression with Automatic Relevance Determination. 
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data {. 

int <lower=0> N; 

int <lower=0> n_age; 

int <lower=0> n_age_edu ; 

int <lower=0> n_edu ; 

int <lower=0> n_region_full ; 

int <lower=0> n_state ; 

int <lower =0,upper=n _age> age [N] ; 

int < lower =0,upper=n_age_edu> age_edu [N] ; 

vector < lower =0, upper = 1>[N] black ; 

int <lower =0,upper=n_edu> edu[N]; 

vector <lower =0,upper = l>[N] female ; 

int <lower=0,upper=n _region_full > region_full [N] ; 

int <lower=0,upper=n_state > state [N] ; 

vector [N] v_prev_full ; 

int <lower =0,upper = l> y [N] ; 

> 

parameters { 

vector [ n_age] a; 

vector[ n_edu] b; 

vector [n_age_edu] c; 

vector [n_state] d; 

vector [n_region_full] e; 

vector [5] beta; 

real <lower =0,upper = 100> sigma_a; 

real <lower =0,upper = 100> sigma_b ; 

real <lower =0,upper = 100> sigma_c; 

real <lower =0,upper = 100> sigma_d ; 
real <lower =0,upper = 100> sigma_e; 

> 

transformed parameters { 
vector [N] y_hat; 


> 


for (i in 1 :N) 

y_hat [ i ] <- beta[l] 

+ beta[2] * 
+ beta[3] * 
+ beta[5] * 
+ beta[4] * 
+ a[age[ i ] ] 
+ b[edu [ i ] ] 


black [ i ] 
female [ i ] 

female [ i ] * black [ i ] 
v_prev_full [ i ] 


+ c [ age_edu [ i ] ] 

+ d [ st at e [ i ] ] 

+ e [ region_full [ i ] ] ; 


model { 


} 


b - 
c ~ 
d - 
e ~ 
beta 

y ~ 


normal (0 , sigma_a) ; 

normal (0 , sigma_b) ; 

normal (0 , sigma_c) ; 

normal (0 , sigma_d) ; 

normal (0, sigma_e); 

~ normal (0 , 100) ; 

bernoulli_logit (y_hat) ; 


Figure 7: 


Stan code for Hierarchical Logistic Regression, from fi]. 
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data {. 

int <lower=0> U; 
int <lower=0> I ; 
int <lower=0> K; 
int <lower=0> y [U, I ] ; 
real <lower=0> a; 
real <lower=0> b; 
real <lower=0> c; 
real <lower=0> d; 

> 

parameters { 

positive_ordered [K] theta [U] ; // user preference 

vector <lower=0>[K] beta [ I ] ; // item attributes 


model { 

for (u in 1:U) 

theta [u] ~ gammaCa, b) ; // componentwise gamma 
for (i in 1:1) 

beta [ i ] ~ gamma(c, d) ; // componentwise gamma 

for (u in 1:U) { 
for (i in 1:1) { 

increment_log_prob ( 

poisson_log( y[u,i], theta[u] ‘* beta[i]) ); 

> 

> 


Figure 8: Stan code for Gamma Poisson non-negative matrix factorization model. 


data { 

int <lower=0> U; 

int <lower=0> I ; 

int <lower=0> K; 

int <lower=0> y[U,I]; 

real <lower=0> lambdaO ; 

real <lower=0> alphaO ; 


transformed data { 

vector <lower=0>[K] alphaO_vec ; 

for (k in 1:K) { 

alphaO_vec[k] <- alphaO; 

> 

> 

parameters { 

simplex [K] theta [U] ; // user preference 

vector <lower=0>[K] beta [ I ] ; // item attributes 


model { 

for (u in 1:U) 

theta [u] ~ dirichlet (alphaO_vec); // componentwise dirichlet 

for (i in 1:1) 

beta [ i ] ~ exponential (lambdaO) ; // componentwise gamma 

for (u in 1:U) { 
for (i in 1:1) { 

increment_log_prob ( 

poisson_log( y[u,i], theta[u] ‘* beta[i]) ); 

> 

> 


Figure 9: Stan code for Dirichlet Exponential non-negative matrix factorization model. 
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data { 

int <lower=0> N; // number of data points in entire dataset 
int <lower=0> K; // number of mixture components 
int <lower=0> D; // dimension 
vector [D] y[N] ; // observations 

real <lower=0> alphaO ; // dirichlet prior 

real <lower=0> mu_sigmaO ; // means prior 

real <lower=0> sigma_sigmaO; // variances prior 


transformed data { 
vector <lower=0>[K] 
for (k in 1:K) {. 
alphaO_vec [k] <- 

> 

> 

parameters { 

simplex [K] theta ; 
vector [D] mu[K] ; 
vector <lower=0>[D] 


model { 

// priors 

theta ~ dirichlet ( alphaO_vec); 

for (k in 1:K) {. 

mu[k] ~ normal (0.0, mu_sigmaO); 

sigma [k] ~ lognormal ( 0.0 , sigma_sigmaO); 

> 

// likelihood 
for (n in 1:N) { 
real ps[K]; 
for (k in 1:K) { 

ps [k] <- log ( theta [k] ) + normal_log(y[n] , mu[k] , sigma[k]) ; 

> 

increment_log_prob (log_sum_exp ( ps ) ) ; 

> 

> 


alphaO_vec ; 

alphaO ; 


// mixing proportions 
// locations of mixture components 
sigma [K] ; // standard deviations of mixture components 


Figure 10: ADVI Stan code for the GMM example. 
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data { 

real <lower=0> N; // number of data points in entire dataset 
int <lower=0> S_in_minibatch ; 


int <lower=0> K; // number of mixture components 
int <lower=0> D; // dimension 


vector [D] y[S_in_minibatch ] ; 


// observations 


real <lower=0> alphaO ; 

real <lower=0> mu_sigmaO ; 

real <lower=0> sigma_sigmaO; 


// dirichlet prior 
// means prior 
// variances prior 


transformed data { 

real SVI_factor ; 

vector <lower=0>[K] alphaO_vec ; 

for (k in 1:K) {. 

alphaO_vec[k] <- alphaO; 

> 

SVI_factor <- N / S_in_minibatch ; 


parameters { 

simplex [K] theta ; 

vector [D] mu[K] ; 

vector <lower=0>[D] sigma [K] 


model { 

// priors 

theta ~ dirichlet ( alphaO_vec); 

for (k in 1:K) { 

mu[k] ~ normal (0.0, mu_sigmaO); 

sigma [k] ~ lognormal ( 0.0 , sigma_sigmaO); 

> 


// mixing proportions 

// locations of mixture components 

// standard deviations of mixture components 


// likelihood 

for (n in 1: S_in_minibatch) { 

real ps[K]; 
for (k in 1:K) { 

ps [k] <- log ( theta [k] ) + normal_log(y[n] , mu[k] , sigma[k]) ; 

> 

increment_log_prob (log_sum_exp ( ps ) ) ; 

> 

increment_log_prob (log ( SVI_factor ) ) ; 


Figure 11: ADVI Stan code for the GMM example, with stochastic subsampling of the 
dataset. 
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